What is the unemployment rate?

First, we must know what is the definition of the unemployment rate to understand how it is derived and what factors are related to it. A person is defined as unemployed in the United States if they are jobless, but have looked for work in the last four weeks and are available for work. To record unemployed, Government distributes survey to sampling population and predict the entire unemployed number in a broad area. Measuring the unemployment gives us a good overview of the ongoing status of the economy, international competition, technology development, and so on.

The equation of the Unemployment Rate is

\[unemployment \space rate = \frac{unemployed}{labor \space force}\times100\] where labor force includes all people age 16 and older who are classified as either employed and unemployed.

In this project, I will focus on analyzing and predicting the unemployment rate in the LA County.

Data Cleaning

unprocessed = read.csv("D:/UCSB/Spring_2022/PSTAT 131/PSTAT_131_HW/HW2/PSTAT-131/Final Project/data/unprocessed_data.csv")
head(unprocessed)
##       DATE unemploy_rate_la avg_price_pipedgas_la avg_price_electr_kwh_La
## 1 1990/1/1              5.9                18.662                   0.105
## 2 1990/2/1              5.6                19.971                   0.108
## 3 1990/3/1              5.4                19.971                   0.108
## 4 1990/4/1              5.5                19.976                   0.108
## 5 1990/5/1              5.4                27.721                   0.107
## 6 1990/6/1              5.4                27.712                   0.108
##   avg_price_gasolone_la civilian_labor_force_la_pch cpi_allitems_la
## 1                 0.957                        #N/A             1.1
## 2                 0.988                         0.2             1.1
## 3                 1.014                         0.1             0.7
## 4                 1.030                        -0.4            -0.2
## 5                 1.080                         0.4             0.3
## 6                 1.103                           0             0.3
##   economics_cond_index_la unemployed_num_pch
## 1                      NA               #N/A
## 2                    1.24               -4.3
## 3                    0.58               -3.8
## 4                   -0.35                  1
## 5                   -0.42               -0.9
## 6                   -0.64               -0.5
##   new_private_housing_structure_issue_la home_price_index_la
## 1                               5121.609                 0.7
## 2                               4648.972                 0.4
## 3                               3628.443                 0.2
## 4                               3833.476                -0.1
## 5                               3466.321                -1.1
## 6                               3496.684                -0.3
##   allemployee_nonfarm_la_pch allemployee_constr_la_pch allemployee_manu_la_pch
## 1                       #N/A                      #N/A                    #N/A
## 2                        0.1                      -1.2                       0
## 3                       -0.1                      -1.1                    -0.4
## 4                       -0.2                      -3.7                    -0.4
## 5                       -0.2                      -1.3                    -0.4
## 6                       -0.3                      -0.9                    -0.7
##   allemployee_finan_la_pch allemployee_leisure_la_pch new_patent_ass_la
## 1                     #N/A                       #N/A               202
## 2                     -0.1                       -0.4               218
## 3                     -0.7                       -0.1               290
## 4                     -0.2                       -0.6               256
## 5                     -0.8                       -0.1               262
## 6                     -0.3                        0.3               242
##   govn_social_insu_pch compen_employee_wage_pch real_disp_inc_per_capital_pch
## 1                  1.6                      0.5                           0.5
## 2                  0.0                      1.2                           0.1
## 3                  1.0                      0.7                          -0.1
## 4                  0.0                      0.9                           0.5
## 5                  0.3                     -0.3                          -0.2
## 6                  1.3                      0.8                           0.0
##   bbk_real_gdp us_interest_rate pers_consum_expen_pch pers_saving_rate
## 1    4.8776608                7                   1.3              8.0
## 2    6.1814509                7                  -0.1              8.6
## 3    2.9195562                7                   0.7              8.3
## 4   -0.5634379                7                   0.4              8.8
## 5    0.7507924                7                   0.2              8.7
## 6    1.1771073                7                   0.8              8.6
##   pers_current_tax_chg govn_social_ben_toperson_pch federal_fund_eff_rate
## 1                 -8.4                          5.3              8.229032
## 2                  8.1                         -0.2              8.237143
## 3                  5.2                          0.7              8.276774
## 4                  4.2                          0.6              8.255000
## 5                  0.6                         -0.5              8.176452
## 6                  3.9                          1.2              8.288667
##   X30_year_fixed_mortgage
## 1                 1.54967
## 2                 3.05710
## 3                 0.69135
## 4                 0.99338
## 5                 1.03664
## 6                -2.99213

Although all variables are supposed to be numeric, but in fact some of them are imported as character variables. Also, we need to deal with missing value in columns contained “PCH” which means “Percent Change”.

unprocessed = unprocessed[-1,]
date = unprocessed[,1]

# delete NA in the first row 
unprocessed = unprocessed[,-1] %>% mutate_if(is.character, as.numeric)

unprocessed$DATE = date
unprocessed = unprocessed %>%
  select(DATE, everything())

# delete variables that have at least 8 missing values 
processed_data = unprocessed %>% select(-avg_price_pipedgas_la, -new_patent_ass_la,
                                     -cpi_allitems_la, -us_interest_rate,
                                     -economics_cond_index_la) %>% 
  head(-2)

sapply(processed_data, function(x) sum(is.na(x)))
##                                   DATE                       unemploy_rate_la 
##                                      0                                      0 
##                avg_price_electr_kwh_La                  avg_price_gasolone_la 
##                                      0                                      0 
##            civilian_labor_force_la_pch                     unemployed_num_pch 
##                                      0                                      0 
## new_private_housing_structure_issue_la                    home_price_index_la 
##                                      0                                      0 
##             allemployee_nonfarm_la_pch              allemployee_constr_la_pch 
##                                      0                                      0 
##                allemployee_manu_la_pch               allemployee_finan_la_pch 
##                                      0                                      0 
##             allemployee_leisure_la_pch                   govn_social_insu_pch 
##                                      0                                      0 
##               compen_employee_wage_pch          real_disp_inc_per_capital_pch 
##                                      0                                      0 
##                           bbk_real_gdp                  pers_consum_expen_pch 
##                                      0                                      0 
##                       pers_saving_rate                   pers_current_tax_chg 
##                                      0                                      0 
##           govn_social_ben_toperson_pch                  federal_fund_eff_rate 
##                                      0                                      0 
##                X30_year_fixed_mortgage 
##                                      0
write.csv(processed_data, "D:\\UCSB\\Spring_2022\\PSTAT 131\\PSTAT_131_HW\\HW2\\PSTAT-131\\Final Project\\data\\processed_data.csv", row.names = FALSE)

EDA

# start with Feb since we delete the first row of unprocessed data
processed = ts(unprocessed[,-1],frequency = 12, start = c(1990,2)) 

autoplot.zoo(processed[,"unemploy_rate_la"])+
  ggtitle("Unemployment Rate in LA County") +
  xlab("Year") +
  ylab("Percentage%")

We discovered that there are a few well-known recession periods from 1990 to now. The collapse of internet bubble, the financial crisis of 2007, and Covid-19 pandemic all matches severe increase of the unemployment rate. We may study how different economics indexes fluctuates during the financial crisis of 2007 which is mainly caused by the mortgage debt.

Now we want to discover the seasonal pattern of the unemployment. Except the fluctuation during the Covid-19 seems abnormal, we discover that the unemployment rate usually peaks in summer. This is a problem requires further research.

ggseasonplot(processed[,"unemploy_rate_la"]) +
  ggtitle("Seasonal Plot of Unemployment Rate in LA County") +
  xlab("Year") +
  ylab("Percentage%")

ggsubseriesplot(processed[,"unemploy_rate_la"]) +
  ylab("$ million") +
  ggtitle("Seasonal subseries plot: LA Unemployment Rate")

Spatial Analysis

Net Migration

census_api_key("7540e4d61b8467521425225cbe8f44f7c1667f9a")
net_migration <- get_estimates(geography = "county", state = "CA",
                               variables = "RNETMIG",
                               year = 2019,
                               geometry = TRUE,
                               resolution = "20m") %>%
  shift_geometry()

order = c("-15 and below", "-15 to -5", "-5 to +5", "+5 to +15", "+15 and up")

net_migration <- net_migration %>%
  mutate(groups = case_when(
    value > 15 ~ "+15 and up",
    value > 5 ~ "+5 to +15",
    value > -5 ~ "-5 to +5",
    value > -15 ~ "-15 to -5",
    TRUE ~ "-15 and below"
  )) %>%
  mutate(groups = factor(groups, levels = order))

state_overlay <- states(
  cb = TRUE,
  resolution = "20m"
) %>%
  filter(GEOID != "72") %>%
  shift_geometry()

ggplot() +
  geom_sf(data = net_migration, aes(fill = groups, color = groups), size = 0.1) + 
  scale_fill_brewer(palette = "PuOr", direction = -1) +
  scale_color_brewer(palette = "PuOr", direction = -1, guide = FALSE)  +
  labs(title = "Net migration per 1000 residents in CA",
       subtitle = "US Census Bureau 2019 Population Estimates",
       fill = "Rate") +
  theme_minimal(base_family = "Roboto")

We found LA county is losing population this 5 years. It requires further discussion how this trend will effect the unemployment rate.

Median Age

#median age
med_age <- get_acs(state = "CA", county = "Los Angeles", geography = "tract", 
                  variables = "B01002_001", geometry = TRUE)
med_age %>%
  ggplot(aes(fill = estimate)) + 
  geom_sf(color = NA) + 
  scale_fill_viridis_c(option = "magma")

Building Up Models

Modeltime model

model_data = read.csv("D:/UCSB/Spring_2022/PSTAT 131/PSTAT_131_HW/HW2/PSTAT-131/Final Project/data/processed_data.csv")

I first use data from 1990-2016 to test whether these models can only use previous values of the unemployment rate and date to forecast in a satisfactory accuracy. Further, I want to utilize models which consider other economics indicators. Hopefully, I can create models that can first perceive economics situation in the US and then determine the trend of the unemployment rate.

Good forecasts capture the genuine patterns and relationships which exist in the historical data, but do not replicate past events that will not occur again. When forecasting time series data, the aim is to estimate how the sequence of observations will continue into the future. Therefore, the main concern may be only to predict what will happen, not to know why it happens.

data = model_data %>%
  select(DATE, unemploy_rate_la) %>% 
  mutate(DATE, DATE = as.Date.character(DATE))

data = data[1:320,] # I don't want to include pandemic 

data %>%plot_time_series(DATE, unemploy_rate_la)

Splitting the data set and creating the training and testing set

splits <- initial_time_split(data, prop = 0.9)

Add fitted models to a Model Table

models_tbl <- modeltime_table(
    model_fit_arima_no_boost,
    model_fit_arima_boosted,
    model_fit_ets,
    model_fit_prophet,
    model_fit_lm,
    wflw_fit_mars
)

models_tbl
## # Modeltime Table
## # A tibble: 6 x 3
##   .model_id .model     .model_desc                              
##       <int> <list>     <chr>                                    
## 1         1 <fit[+]>   ARIMA(4,0,1)(2,1,1)[12]                  
## 2         2 <fit[+]>   ARIMA(2,0,2)(1,1,1)[12] W/ XGBOOST ERRORS
## 3         3 <fit[+]>   ETS(A,AD,A)                              
## 4         4 <fit[+]>   PROPHET                                  
## 5         5 <fit[+]>   LM                                       
## 6         6 <workflow> EARTH

Calibrate the model to a testing set

calibration_tbl <- models_tbl %>%
    modeltime_calibrate(new_data = testing(splits))
calibration_tbl
## # Modeltime Table
## # A tibble: 6 x 5
##   .model_id .model     .model_desc                        .type .calibration_da~
##       <int> <list>     <chr>                              <chr> <list>          
## 1         1 <fit[+]>   ARIMA(4,0,1)(2,1,1)[12]            Test  <tibble>        
## 2         2 <fit[+]>   ARIMA(2,0,2)(1,1,1)[12] W/ XGBOOS~ Test  <tibble>        
## 3         3 <fit[+]>   ETS(A,AD,A)                        Test  <tibble>        
## 4         4 <fit[+]>   PROPHET                            Test  <tibble>        
## 5         5 <fit[+]>   LM                                 Test  <tibble>        
## 6         6 <workflow> EARTH                              Test  <tibble>
calibration_tbl %>%
    modeltime_forecast(
        new_data    = testing(splits),
        actual_data = data
    ) %>%
    plot_modeltime_forecast(
      .legend_max_width = 25
    )
calibration_tbl %>%
    modeltime_accuracy() %>%
    table_modeltime_accuracy(
        .interactive = FALSE
    )
Accuracy Table
.model_id .model_desc .type mae mape mase smape rmse rsq
1 ARIMA(4,0,1)(2,1,1)[12] Test 0.73 11.89 2.19 11.09 0.85 0.64
2 ARIMA(2,0,2)(1,1,1)[12] W/ XGBOOST ERRORS Test 0.72 12.68 2.18 11.22 1.01 0.67
3 ETS(A,AD,A) Test 0.60 10.02 1.82 9.35 0.74 0.85
4 PROPHET Test 6.37 100.78 19.17 64.20 6.60 0.52
5 LM Test 2.35 38.82 7.07 30.58 2.67 0.00
6 EARTH Test 0.64 10.57 1.94 9.80 0.75 0.94

It seems that

  1. need to tune hyper-parameter
  2. need cross validation
  3. Analyze the ARIMA models
  1. Differencing and White Noise
  2. acf() and pacf()
  3. KPSS Unit Root Test
  1. need to explain intuition of these functions

Random Forest

Neural Network